MATH8670 Course Project Presentation¶

Presenter: Jeremy James

Date: 12/14/2021

Introduction¶

  • Who is Home Credit?
    • Business
    • Loan Challenges
  • Kaggle Competition
  • Goals
    • Model that effectively detects loans that will default while being explainable
    • Find features most predictive of a default

Data Exploration¶

Application Data¶

See Full Dataframe in Mito
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0

Bureau Data¶

See Full Dataframe in Mito
SK_ID_BUREAU MONTHS_BALANCE STATUS
0 5715448 0 C
1 5715448 -1 C
2 5715448 -2 C
3 5715448 -3 C
4 5715448 -4 C

Trends¶

  • Sweetviz Report
  • Target Variable Distribution

Data Engineering¶

  • Removed several columns with too many nulls
  • Used KNN imputation to replace remaining nulls
  • Merged bureau balance data into application dataset
  • Multiple other features

Models¶

Weighted/Balanced Random Forest¶

  • Weighted method increases penality mislabeled minority class observations
  • Balanced method undersamples majority class
  • From scikit-learn and imblearn, respectively
  • Both were demonstrated to be able to handle unbalanced data, thanks Michael and Marcel

Explainable Boosted Machine w/ weights¶

  • GAM with tree ensemble feature functions and support for interactions
  • From interpretml
  • Supports weights, allowing it to be used for unbalanced data
  • Explainable

Results Table¶

See Full Dataframe in Mito
Model Precision Recall Accuracy
2 EBM 0.67 0.16 0.70
1 BRF 0.66 0.16 0.69
0 WRF 0.00 0.00 0.92

Explanation¶

  • Used SHAP to understand Balanced Random Forest
  • Used out-of-the-box explanations for EBM

EBM Global Explanations¶

  • Find most important features
  • Understand nature of relationship between independent variable and default variable

Feature Importance¶

Education¶

EBM Local Explanation¶

  • Similar takeaways as global explanations
  • Justify individual prediction

No Default Prediction¶

Default Prediction¶

Summary¶

  • Test scores are most predictive of loan defaults
  • Even with significant data engineering and modeling work, higher precision came at a significant cost accuracy wise
  • Explainablility doesn't mean worse performance